Search CORE

Public Library of Science (PLOS)

Targeted Assembly of Short Sequence Reads

Author: H Li
H Li
H Li
JD Freeman
JT Simpson
LD Stein
M Rasmussen
Olivier Lespinet
R Goya
R Li
R Li
R Morin
René L. Warren
RK Nam
RL Warren
RL Warren
RM Durbin
Robert A. Holt
S Nacu
SP Shah
WR Jeck
Publication venue
Publication date: 01/01/2011
Field of study

As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants, by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled strin-gently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming ge-nomic mutations, polymorphism, fusion and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly

CiteSeerX

Simon Fraser University Institutional Repository

Nature Precedings

Propensity score analysis in the Genetic Analysis Workshop 17 simulated data set on independent individuals

Author: Berit Kerner
Chen Min Lin
CK Wu
EA Stuart
EA Stuart
Fah J Sathirapongsasuti
H Zhao
LA Almasy
PR Rosenbaum
RM Durbin
S Guo
S Weitzen
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Genetic Analysis Workshop 17 provided simulated phenotypes and exome sequence data for 697 independent individuals (209 case subjects and 488 control subjects). The disease liability in these data was influenced by multiple quantitative traits. We addressed the lack of statistical power in this small data set by limiting the genomic variants included in the study to those with potential disease-causing effect, thereby reducing the problem of multiple testing. After this adjustment, we could readily detect two common variants that were strongly associated with the quantitative trait Q1 (C13S523 and C13S522). However, we found no significant associations with the affected status or with any of the other quantitative traits, and the relationship between disease status and genomic variants remained obscure. To address the challenge of the multivariate phenotype, we used propensity scores to combine covariates with genetic risk factors into a single risk factor and created a new phenotype variable, the probability of being affected given the covariates. Using the propensity score as a quantitative trait in the case-control analysis, we again could identify the two common single-nucleotide polymorphisms (C13S523 and C13S522). In addition, this analysis captured the correlation between Q1 and the affected status and reduced the problem of multiple testing. Although the propensity score was useful for capturing and clarifying the genetic contributions of common variants to the disease phenotype and the mediating role of the quantitative trait Q1, the analysis did not increase power to detect rare variants

Springer - Publisher Connector

eScholarship - University of California

ENGINES: exploring single nucleotide variation in entire human genomes

Author: Antonio Salas
Christopher Phillips
E Peacock
J Amigo
J Amigo
JK Pritchard
JM Akey
Jorge Amigo
JZ Li
L Excoffier
RC Lewontin
RM Durbin
The International HapMap Consortium
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Next generation ultra-sequencing technologies are starting to produce extensive quantities of data from entire human genome or exome sequences, and therefore new software is needed to present and analyse this vast amount of information. The 1000 Genomes project has recently released raw data for 629 complete genomes representing several human populations through their Phase I interim analysis and, although there are certain public tools available that allow exploration of these genomes, to date there is no tool that permits comprehensive population analysis of the variation catalogued by such data. Description We have developed a genetic variant site explorer able to retrieve data for Single Nucleotide Variation (SNVs), population by population, from entire genomes without compromising future scalability and agility. ENGINES (ENtire Genome INterface for Exploring SNVs) uses data from the 1000 Genomes Phase I to demonstrate its capacity to handle large amounts of genetic variation (>7.3 billion genotypes and 28 million SNVs), as well as deriving summary statistics of interest for medical and population genetics applications. The whole dataset is pre-processed and summarized into a data mart accessible through a web interface. The query system allows the combination and comparison of each available population sample, while searching by rs-number list, chromosome region, or genes of interest. Frequency and FST filters are available to further refine queries, while results can be visually compared with other large-scale Single Nucleotide Polymorphism (SNP) repositories such as HapMap or Perlegen. Conclusions ENGINES is capable of accessing large-scale variation data repositories in a fast and comprehensive manner. It allows quick browsing of whole genome variation, while providing statistical information for each variant site such as allele frequency, heterozygosity or FST values for genetic differentiation. Access to the data mart generating scripts and to the web interface is granted from <url>http://spsmart.cesga.es/engines.php</url></p

Springer - Publisher Connector

Repositorio Institucional da Universidade de Santiago de Compostela

Identification of genes associated with complex traits by testing the genetic dissimilarity between individuals

Author: BFJ Manly
C Dering
J Thioulouse
Kerby A Shedden
L Beckmann
LA Almasy
MJ Fortin
N Mantel
RM Durbin
Sharon LR Kardia
WD Shannon
Wei Zhao
Yan V Sun
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Using the exome sequencing data from 697 unrelated individuals and their simulated disease phenotypes from Genetic Analysis Workshop 17, we develop and apply a gene-based method to identify the relationship between a gene with multiple rare genetic variants and a phenotype. The method is based on the Mantel test, which assesses the correlation between two distance matrices using a permutation procedure. Using up to 100,000 permutations to estimate the statistical significance in 200 replicate data sets, we found that the method had 5.1% type I error at an α level of 0.05 and had various power to detect genes with simulated genetic associations. FLT1 and KDR had the most significant correlations with Q1 and were replicated 170 and 24 times, respectively, in 200 simulated data sets using a Bonferroni corrected p-value of 0.05 as a threshold. These results suggest that the distance correlation method can be used to identify genotype-phenotype association when multiple rare genetic variants in a gene are involved

Springer - Publisher Connector

Deep Blue Documents at the University of Michigan

Reply: In vitro and in vivo anticancer efficacy of unconjugated humanised anti-CEA monoclonal antibodies

Author: DM Goldenberg
F Nimmerjahn
G Hajjar
H Durbin
J L Wilding
JM Reichert
JY Wong
LM Stewart
M Granowska
P J Conaghan
PJ Carter
PJ Conaghan
R Clynes
RD Blumenthal
RM Sharkey
RM Sharkey
RW Wilkinson
S Q Ashraf
T Liersch
T Liersch
W F Bodmer
Publication venue: Nature Publishing Group
Publication date: 01/01/2008
Field of study

Oxford University Research Archive

Rare variant density across the genome and across populations

Author: A Ramírez-Soriano
AD Gika
AP Morris
CS Carlson
E Shtivelman
F Tajima
FD Ciccarelli
LA Almasy
M Kimura
MJ Hall
Paola Raska
RM Durbin
V Apanius
V Bansal
Xiaofeng Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Integrative Genomics Viewer

Author: C Nielsen
Eric S Lander
Gad Getz
H Bao
Helga Thorvaldsdóttir
I Milne
James T Robinson
Jill P Mesirov
JW Nicol
K Rutherford
M Fiume
M Guttman
MF Berger
Mitchell Guttman
RG Verhaak
RM Durbin
W Huang
Wendy Winckler
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Author Manuscript 2012 May 07.To the Editor: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.National Institute of General Medical Sciences (U.S.) (R01GM074024)National Cancer Institute (U.S.) (R21CA135827)National Human Genome Research Institute (U.S.) (U54HG003067

Abundant Human DNA Contamination Identified in Non-Primate Genome Databases

Author: AM Waterhouse
GE Liu
GN Rutty
H Malmstrom
HN Poinar
J Jurka
MA Larkin
Mark S. Longo
Michael J. O'Neill
Najib El-Sayed
NC Kyrpides
PL Deininger
Rachel J. O'Neill
RM Durbin
SF Altschul
TJ Katz
WJ Kent
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

During routine screens of the NCBI databases using human repetitive elements we discovered an unlikely level of nucleotide identity across a broad range of phyla. To ascertain whether databases containing DNA sequences, genome assemblies and trace archive reads were contaminated with human sequences, we performed an in depth search for sequences of human origin in non-human species. Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring

CiteSeerX

Public Library of Science (PLOS)

SNPs Occur in Regions with Less Genomic Sequence Conservation

Author: A Grimson
A Siepel
BP Lewis
CF Baer
CM Wade
D Chasman
E Pennisi
ES Lander
FW Allendorf
GE Crooks
H Zhang
Ilya Ruvinsky
J Stapley
JC Venter
John C. Castle
JV Chamary
K Chen
L Cartegni
M Lynch
M Stratton
MA Saunders
MP Miller
PA Morin
RH Waterston
RM Durbin
RM Kuhn
ST Sherry
V Matys
WG Fairbrother
Publication venue: Public Library of Science
Publication date: 06/06/2011
Field of study

Rates of SNPs (single nucleotide polymorphisms) and cross-species genomic sequence conservation reflect intra- and inter-species variation, respectively. Here, I report SNP rates and genomic sequence conservation adjacent to mRNA processing regions and show that, as expected, more SNPs occur in less conserved regions and that functional regions have fewer SNPs. Results are confirmed using both mouse and human data. Regions include protein start codons, 3′ splice sites, 5′ splice sites, protein stop codons, predicted miRNA binding sites, and polyadenylation sites. Throughout, SNP rates are lower and conservation is higher at regulatory sites. Within coding regions, SNP rates are highest and conservation is lowest at codon position three and the fewest SNPs are found at codon position two, reflecting codon degeneracy for amino acid encoding. Exon splice sites show high conservation and very low SNP rates, reflecting both splicing signals and protein coding. Relaxed constraint on the codon third position is dramatically seen when separating exonic SNP rates based on intron phase. At polyadenylation sites, a peak of conservation and low SNP rate occurs from 30 to 17 nt preceding the site. This region is highly enriched for the sequence AAUAAA, reflecting the location of the conserved polyA signal. miRNA 3′ UTR target sites are predicted incorporating interspecies genomic sequence conservation; SNP rates are low in these sites, again showing fewer SNPs in conserved regions. Together, these results confirm that SNPs, reflecting recent genetic variation, occur more frequently in regions with less evolutionarily conservation